Expose native Lance scan descriptor for datafusion-comet integration by wirybeaver · Pull Request #624 · lance-format/lance-spark

wirybeaver · 2026-06-12T08:39:48Z

Closes #623.

Summary

This adds a stable native-read descriptor for ordinary Lance Spark scans so native engines can consume Spark-planned Lance reads without depending on Lance Spark internals.

The descriptor captures:

Dataset URI and resolved dataset version.
Spark read schema JSON and projected read schema JSON.
Projected columns, pushed filter SQL, limit/offset, batch size, and storage options.
Per-partition native splits with Lance fragment IDs.
Explicit fallback reasons when a scan cannot be represented by the minimal v1 descriptor.

The v1 scope is ordinary table reads only. Search/hybrid search, index-backed execution descriptors, aggregation pushdown, metadata/version columns, and namespace-backed credential refresh remain fallback/future work.

Why LanceScan carries the descriptor state

This PR adds several parameters to the LanceScan constructor because LanceScan is the object Spark keeps inside BatchScanExec after planning. A native consumer such as Comet sees that final BatchScanExec(scan = LanceScan) object, not the earlier LanceScanBuilder or catalog planning context. Therefore nativeScanPlan() needs the complete, already-resolved scan snapshot on LanceScan itself.

The goal is not to make the constructor a broad public API. The goal is to avoid asking native consumers to infer or recompute Lance Spark planning semantics from partial state. Re-planning later would be risky because it could reopen a different dataset version, use different storage options, produce different fragments, or miss fallback-only state such as pushed TopN/aggregation.

The added state is the minimum needed to describe or reject the native v1 scan accurately:

sparkReadSchema and schema keep the Spark-visible schema and projected read schema separate, which matters when Spark-facing fields differ from the physical/projection schema.
readOptions provides dataset URI, resolved version, batch size, table/catalog identifiers, and user storage options. The resolved version is required so native execution cannot drift to a newer Lance snapshot.
whereConditions, limit, and offset are serialized into the descriptor when v1 supports them.
topNSortOrders and pushedAggregation are carried even though v1 falls back for them, because the descriptor must reject those scans explicitly instead of silently dropping semantics.
pushedPredicates, zonemap stats, surviving fragment IDs, precomputed splits, and fragment row counts preserve the fragment-pruning and limit-pruning decisions Lance Spark already made on the driver.
activeShardingExpression and fragmentShardingKeys preserve the existing partitioning/reporting contract used by Lance Spark planning.
initialStorageOptions, namespaceImpl, and namespaceProperties preserve storage option precedence and the namespace context that workers/native readers need for the same dataset access path.

A follow-up cleanup could wrap this constructor state into an internal immutable scan-state object if reviewers prefer that shape. This PR keeps the change direct so the descriptor contract and tests are easy to review first.

Testing

./mvnw -pl lance-spark-base_2.12 -Dtest=LanceScanTest -Dspotless.skip=true test
./mvnw -pl lance-spark-4.1_2.13 -am -Dtest=LanceScanTest -Dspotless.skip=true -Dsurefire.failIfNoSpecifiedTests=false test
./mvnw -pl lance-spark-base_2.12 spotless:check

github-actions · 2026-06-12T08:40:06Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

wirybeaver added 2 commits June 12, 2026 08:08

feat: expose native Lance scan descriptor

e72785e

fix: preserve native read storage option precedence

0888e58

This was referenced Jun 12, 2026

Add optional native Lance scan support apache/datafusion-comet#4633

Draft

Add optional native Lance scan support apache/datafusion-comet#4632

Open

Expose a stable native scan descriptor for Lance Spark reads and datafusion-comet integration #623

Open

wirybeaver changed the title ~~Expose native Lance scan descriptor~~ Expose native Lance scan descriptor for datafusion-comet integration Jun 12, 2026

wirybeaver marked this pull request as draft June 12, 2026 15:50

sezruby mentioned this pull request Jun 16, 2026

Lance-spark support in Gluten apache/gluten#12263

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose native Lance scan descriptor for datafusion-comet integration#624

Expose native Lance scan descriptor for datafusion-comet integration#624
wirybeaver wants to merge 2 commits into
lance-format:mainfrom
wirybeaver:xuanyili/native-lance-read-descriptor

wirybeaver commented Jun 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wirybeaver commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why LanceScan carries the descriptor state

Testing

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wirybeaver commented Jun 12, 2026 •

edited

Loading